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Abstract 

This paper presents a computation of the dimension for regression in bounded subspaces 
of Reproducing Kernel Hilbert Spaces (RKHS) for the Support Vector Machine (SVM) 
regression e-insensitive loss function, and general L p loss functions. Finiteness of the Vy 
dimension is shown, which also proves uniform convergence in probability for regression 
machines in RKHS subspaces that use the L e or general L p loss functions. This paper 
presents a novel proof of this result also for the case that a bias is added to the functions 
in the RKHS. 
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1 Introduction 


In this paper we present a computation of the dimension of real-valued functions L(y, /(x)) = 

| y — /(x)| p and (Vapnik’s e-insensitive loss function [10]) L(y, /(x)) — \y — /(x)| e with / in a 
bounded sphere in a Reproducing Kernel Hilbert Space (RKHS). We show that the VI, dimension 
is finite for these loss functions, and compute an upper bound on it. We present two solutions to 
the problem. First we discuss a simple argument which leads to a loose upper bound on the 14, 
dimension. Then we refine the result in the case of infinite dimensional RKHS, which is often the 
type of hypothesis spaces considered in the literature (i.e. Radial Basis Functions [8, 5]). Our 
result applies to standard regression learning machines such as Regularization Networks (RN) 
and Support Vector Machines (SVM). We also present a novel computation of the V 7 dimension 
when a bias is added to the functions, that is with / being of the form f = fo + b, where b e R 
and fo is in a sphere in an infinite dimensional RKHS. 

For a regression learning problem using L as a loss function it is known [1] that finiteness of 
the V 1 dimension for all 7 > 0 is a necessary and sufficient condition for uniform convergence in 
probability [10]. So the results of this paper have implications for uniform convergence both for 
RN and for SVM regression [4], 

Previous related work addressed the problem of pattern recognition where L is an indicator 
function [?, 6]. The fat-shattering dimension [1] was considered instead of the V-y one. A different 
approach to proving uniform convergence for RN and SVM is given in [12] where covering number 
arguments using entropy numbers of operators are presented. In both cases, regression as well 
as the case of non-zero bias b were marginally considered. 

The paper is organized as follows. Section 2 outlines the background and motivation of this work. 
The reader familiar with statistical learning theory and RKHS can skip this section. Section 3 
presents a simple proof of the results as well as some upper bounds to the V 1 dimension. Section 
4 presents a refined computation in the case of infinite dimensional RKHS, as well as the case 
where the hypothesis space consists of functions of the form f = fo + b where b G R and fo in a 
sphere in a RKHS. Finally, section 5 discusses possible extensions of this work. 


2 Background and Motivation 

We consider the problem of learning from examples as it is viewed in the framework of statistical 
learning theory [10]. We are given a set of l examples {(xi, yf) i .., (x*, yf)} generated by randomly 
sampling from a space X x Y with X C R d , Y C R according to an unknown probability 
distribution P(x, y). Throughout the paper we assume that X and Y are bounded. Using this 
set of examples the problem of learning consists of hireling a function / : X —> Y that can be 
used given any new point x e X to predict the corresponding value y. 

The problem of learning from examples is known to be ill-posed [10, 9]. A classical way to solve it 
is to perform Empirical Risk Minimization (ERM) with respect to a certain loss function, while 
restricting the solution to the problem to be in a “small” hypothesis space [10]. Formally this 
means minimizing the empirical risk / em p[/] = 7 X)i=i /(xj)) with / e H, where L is the 
loss function measuring the error when we predict /(x) while the actual value is y, and 7i is a 
given hypothesis space. 

In this paper, we consider hypothesis spaces of functions which are hyperplanes in some feature 
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space: 


( 1 ) 


OO 

/( X ) = X] Wn<£n( x ) 

n= 1 

with: 



where 0 n (x) is a set of given, linearly independent basis functions, A n are given non-negative 
constants such that Yff(=\ A^ < oo. Spaces of functions of the form ( 1 ) can also be seen as 
Reproducing Kernel Hilbert Spaces (RKHS) [2, 11] with kernel K given by: 

OO 

^( x ,y) = A n 0 n (x)0 n (y). (3) 

n= 1 

For any function / as in ( 1 ), quantity ( 2 ) is called the RKHS norm of /, ||/H|-, while the number 
D of features <f> n (which can be finite, in which case all sums above are finite) is the dimensionality 
of the RKHS. 

If we restrict the hypothesis space to consist of functions in a RKHS with norm less than a 
constant A, the general setting of learning discussed above becomes: 

Minimize : \ £- =1 L{y u /(x, : )) 

subject to : \\f\\ 2 K < A 2 . (4) 

An important question for any learning machine of the type (4) is whether it is consistent: as 
the number of examples (xj, yf) goes to infinity the expected error of the solution of the machine 
should converge in probability to the minimum expected error in the hypothesis space [10, 3]. In 
the case of learning machines performing ERM in a hypothesis space (4), consistency is shown to 
be related with uniform convergence in probability [ 10 ], and necessary and sufficient conditions 
for uniform convergence are given in terms of the V-y dimension of the hypothesis space considered 
[1, 7], which is a measure of complexity of the space. 

In statistical learning theory typically the measure of complexity used is the VC-dimension. 
However, as we show below, the VC-dimension in the above learning setting in the case of 
infinite dimensional RKHS is infinite both for L p and L e , so it cannot be used to study learning 
machines of the form (4). Instead one needs to consider other measures of complexity, such as 
the Vry dimension, in order to prove uniform convergence in infinite dimensional RKHS. We now 
present some background on the VI, dimension [ 1 ], 

In the case of indicator functions the definition of the 14, dimension is: 

Definition 2.1 The V^-dimension of a set of indicator functions {d(/(x)), / G Ti} (where 6 is 
the heavy side function), is the maximum number h of vectors x l5 .. .,x^ that can be separated 
into two classes in all 2 h possible ways using functions of the set mid the rules: 

class 1 if: /(x) > s + 7 
class -1 if: /(x) < s — 7 

for some s > 0. If, for any number N, it is possible to find N points xi,... ,xat that can be 
separated in all the 2 N possible ways, we will say that the V^-dimension of the set is infinite. 
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For 7 = s = 0 this becomes the definition of the VC dimension of a set of indicator functions 
0(/(x)) [10]. Moreover, in the case of hyperplanes (1), the V 1 dimension has also been referred 
to in the literature [10] as the VC dimension of hyperplanes with margin. In order to avoid 
confusion with names, we call the VC dimension of hyperplanes with margin as the V 1 dimension 
of hyperplanes (for appropriate 7 depending on the margin, as discussed below). The VC 
dimension has been used to bound the growth function Q n {f). This function measures the 
maximum number of ways we can separate / points using functions from hypothesis space Tt. If 
h is the VC dimension, then G n (l) is 2 l if l < h, and < (^) h otherwise [10]. 

In the case of real valued function (which is the case we are interested in here) the definition of 
the V f dimension is: 

Definition 2.2 Let C < L(y,f(x)) < B, f e Tt, with C and B < 00 . The V^-dimension 
of L in Tt (of the set {L(y, /(x)), / e Tt}) is defined as the maximum number h of vectors 
(xi,?/i) ..., (xh,yh) that can be separated into two classes in all 2 h possible ways using rules: 

class 1 if: L(yi, f(xi )) > s + 7 
class -1 if: L(yi , /(ay)) <s- 7 

for f G Tt and some C + 7 < s < B — 7 . If, for any number N, it is possible to find N 
points (xi,yi )..., (x N ,y N ) that can be separated in all the 2 N possible ways, we will say that the 
Vy-dimension of L in Tt is infinite. 

For 7 = 0 and for s being free to change values for each separation of the data, this becomes the 
VC dimension of the set of functions [10]. 

Using the V/ dimension Alon et al. [1] gave necessary and sufficient conditions for uniform 
convergence in probability to take place in a hypothesis space Tt. In particular they proved the 
following important theorem: 

Theorem 2.1 (Alon et al. , 1997 ) Let C < L(y,f(x ))) < B, f e Tt, Tt be a set of bounded 
functions. The ERM method uniformly converges (in probability) if and only if the b/ dimension 
of L in Tt is finite for every 7 > 0. 

It is clear that if for learning machines of the form (4) the V 7 dimension of the loss function L 
in the hypothesis space defined is finite for V 7 > 0, then for these machines uniform convergence 
takes place. In the next section we present a simple proof of the finiteness of the V/ dimension, 
as well as an upper bound on it. 

2.1 Why not use the VC-dimension 

Consider first the case of L p loss functions. Consider an infinite dimensional RKHS, and the set 
of functions with norm \\f\\\ < A 2 . If for any N we can find N points that we can shatter using 
functions of our set according to the rule: 

class 1 if : | y — /(x) \ p > s 

class — 1 if : | y — /(x) \ p < s 

then clearly the VC dimension is infinite. Consider N points (x^yf) with y* = 0 for all i, and 
Xj be such that the smallest eigenvalue of matrix G with G tJ = K (x t , x^) is > A. Since we are 
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in infinite dimensional RKHS, matrix G is always invertible [ 11 ], so A > 0 since G is positive 
definite. 

For any separation of the points, we consider a function / of the form /(x) = YliLi Q!jiF(xj,x), 

which is a function of the form (1). We need to show that we can find coefficients cy such that 

the RKHS norm of the function is < A 2 . Notice that the norm of a function of this form is ct T Ga 
where (a)* = cq (throughout the paper bold letters are used for noting vectors). Consider the 
set of linear equations 

w 1 

Xj G class 1 : X)i=i a iG tJ — sp +r) rj > 0 
Xj G class — 1 : X)i=i a* Gy/ — sp — rj rj > 0 

Let s — ( 1 . If we can find a solution ct to this system of equations such that ot T Got < A 2 we can 
perform this separation, and since this is any separation we can shatter the N points. Notice 
that the solution to the system of equations is G~ l r] where rj is the vector whose components 
are (rj)i = rj when is in class 1 , and —rj otherwise. So we need (G~ 1 r]) T G(G~ 1 r]) < A 2 => 
rj 1 G~ 1 r] < A 2 . Since the smallest eigenvalue of G is > A > 0, rf 1 G ~ 1 rj < H-TL Moreover 
rj 1 rj = Nr) 2 . So if we choose rj small enough such that < A 2 => rj 2 < the norm of the 
solution is less than A 2 , which completes the proof. 

For the case of the L t loss function the argument above can be repeated with y % = e to prove 
again that the VC dimension is infinite in an infinite dimensional RKHS. 

Finally, notice that the same proof can be repeated for finite dimensional RKHS to show that 
the VC dimension is never less than the dimensionality D of the RKHS, since it is possible to 
find D points for which matrix G is invertible and repeat the proof above. As a consequence the 
VC dimension cannot be controlled by A 2 . 


3 A simple upper bound on the V 7 dimension 

Below we always assume that data X are within a sphere of radius R in the feature space defined 
by the kernel K of the RKHS. Without loss of generality, we also assume that y is bounded 
between —1 and 1. Under these assumptions the following theorem holds: 


Theorem 3.1 The V 1 dimension h for regression using L p or L e loss functions for hypothesis 
spaces Ti A = {/(x) = ^n0n( x ) I j 21 A A 2 } and y bounded, is finite for Vy >0. If D 

is the dimensionality of the RKHS, then h < 0(min(D, ^ +1 ^ A +1 ^ ))■ 

Proof. Let’s consider first the case of the L\ loss function. Let B be the upper bound on the 
loss function. From definition 2.2 we can decompose the rules for separating points as follows: 


class 1 if 
or 

class — 1 if 
or 


Vi ~ /( x *) > s + 7 
/( x i) - Vi > s + 7 
Vi ~ /( x *) < s - 7 
/( x i) -Vi<s- 7 


(5) 


for some 7 < s < B — 7. For any N points, the number of separations we can get using rules 
(5) is not more than the number of separations we can get using the product of two indicator 


4 



functions (of hyperplanes with margin): 


function (a) : 
function (b) : 


class 1 if 
class — 1 if 
class 1 if 
class — 1 if 


y, - /i(xi) > Si + 7 
Vi - /l(xi) < Si - 7 
/2(Xi) - Vi > S2 + 7 
/2(Xi) -Vi<S2-l 


( 6 ) 


where fi and f 2 are in H A , 7 < Si, s 2 < B — 7 . For si = s 2 = s and for f\ — f 2 — f we recover 
(5): for example, if y — /(x) > s + 7 then indicator function (a) will give —1, indicator function 
(b) will give also —1, so their product will give +1 which is what we get if we follow (5). So since 
we give more freedom to fi, f 2 , si, s 2 clearly we can get at least as many separations for any set 
of points than the number of separations we would get using (5). 

As mentioned in the previous section, for any N points the number of separations is bounded 

by the growth function. Moreover, for products of indicator functions it is known [10] that the 

growth function is bounded by the product of the growth functions of the indicator functions. 

Furthermore, the indicator functions in ( 6 ) are hyperplanes with margin in the D + l dimensional 

space of vectors { 0 n (x), 7 /} where the radius of the data is R 2 + 1 , the norm of the hyperplane 

is bounded by A 2 + 1 , (where in both cases we add 1 because of y), and the margin is at 
2 

least 3427 ^- The V 7 dimension h 7 of these hyperplanes is known [10, ?] to be bounded by h 7 < 
min((F> + 1) + 1, — +1 ]^ A +1 ^ ). So the growth function of the separating rules (5) is bounded by 
Q{1) < {^-) hl (^-) h ~' : whenever / > h 7 . If h™ 9 is the V 7 dimension, then hi" y e9 cannot be larger 
than the larger number / for which inequality 2 l < (R.) ^7 (R.) h " ! holds. From this we get that 

/ < 5 h 7 , therefore hi r y e9 < 5 min (D + 2, — +1 ^ 2 A +1 ' > ) which proves the theorem for the case of L\ 
loss functions. 

For general L. p loss functions we can follow the same proof where (5) now needs to be rewritten 
as: 


class 1 if 
or 

class — 1 if 
or 


Vi ~ /(Xi) > (s + y)p 
/(xi) -yi > (s + y)p 
y*-/(xi) <(s-j)p 
/(Xi) -Vi< (s-j)p 


(V 


Moreover, for p > 1 , (s + 7)? > s? + ^ (since 7 = f(s + 7)?) — ( s ^) — (( s + 7)^ — s ^)(pB)) 
and (s — 7)? < sp — ^ (similarly). Repeating the same argument as above, we get that the V 7 

dimension is bounded by 5 min ( D + 2 , (P'F) Fk-Fllhl +1 ^ ), Finally, for the L e loss function ( 5 ) can 
be rewritten as: 


class 1 if 
or 

class — 1 if 
or 


Vi ~ /(Xi) > s + 7 + e 
/(x*) -yi>s + ~/ + e 
Vi - /(xi) < s - 7 + e 
/(xi) -Vi<s~ 7 + e 


( 8 ) 


where calling s' = s + e we can simply repeat the proof above and get the same upper bound on 
the V 7 dimension as in the case of the L\ loss function. (Notice that the constraint 7 < s < B — 7 
is not taken into account. We believe that taking this into account may slightly change the V 7 
dimension for L e ). □ 

Notice that these results imply that in the case of infinite dimensional RKHS the V 7 dimension is 
still finite and is influenced only by 5 ^' +1 y y4 +1 ^ • In the next section we present a more refined 
upped bound of the V 7 dimension in the infinite dimensional case. 
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4 A more refined computation of the V 7 dimension 

Below we assume that the data x are bounded so that for any N x N matrix G with entries 
Gij = K (xj, Xj) (where K is, as mentioned in the previous section, the kernel of the RKHS 
considered) the largest eigenvalue of G is always < R 2 . Notice that R 2 is related to the radius 
of the data x in the “feature space” {0 n (x)} typically considered in the literature [10, ?, 6 ]. We 
also note with B the upper bound of L(y, /(x)). 


Theorem 4.1 The V 1 dimension for regression using L x loss function and for hypothesis space 
TL a = {/(x) = Yfff=\ w n 0n( x ) + b | y 1 < A 2 } is finite for Vy > 0 . In particular: 

1. If b is constrained to be zero, then V 1 < 


R 2 A 2 

-v2 


2. If b is a free parameter, V 1 < 4 


r 2 a 2 

^2 


Proof of part 1. 

Suppose we can find N > 


R 2 A 2 


L 7 


points {(ay, yi), ..., (xn, Un)} that we can shatter. Let s G 


[ 7 , B — 7 ] be the value of the parameter used to shatter the points. 

Consider the following “separation” 1 : if \yf\ < s, then (xi,yf) belongs in class 1. All other points 
belong in class -1. For this separation we need: 


\y% - f(xi)\ > s + 7, if \Ui\ < S 
\Vi ~ f(xi)\ < s - 7 , if \yi\ > s 


This means that: for points in class 1 / takes values either yi + s + 7 +-8i or yi — s — 7 — Si, for 
Si > 0. For points in the second class / takes values either y t + s — 7 — Si or y, — s + 7 + Si, for 
Si G [0, (s — 7 )]. So (9) can be seen as a system of linear equations: 

OO 

^2 Wn^nfX-i) = U. ( 10 ) 

71=1 


with ti being y { + s + 7 + S u or ^ - s - 7 - Si, or y t + s - 7 - S u or ^ - s + 7 + S t , depending 
on i. We hrst use lemma 1 to show that for any solution (so t, t are hxed now) there is another 
solution with not larger norm that is of the form J22=i a iA'(xj,x). 

Lemma 4.1 Among all the solutions of a system of equations (10) the solution with the minimum 
RKHS norm is of the form: J2iLi ajlF(xj,x) with a. = G -1 t. 

For a proof see the Appendix. Given this lemma, we consider only functions of the form 
J2iLi onKfxi, x). We show that the function of this form that solves the system of equations 
(10) has norm larger than A 2 . Therefore any other solution has norm larger than A 2 which 
implies we cannot shatter N points using functions of our hypothesis space. 

The solution a = needs to satisfy the constraint: 

ct T Ga. = t r G~ l t < A 2 

1 Notice that this separation might be a “trivial” one in the sense that we may want all the points to be +1 or 
all to be -1 i.e. when all \yt\ < s or when all \y) > s respectively. 
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Let X rnax be the largest eigenvalue of matrix G. Then t T G 1 t > -r—^. Since X max < R 2 , 

T Amax 

t T G~ 1 t > Moreover, because of the choice of the separation, t 1 1 > IV 7 2 (for example, for the 
points in class 1 which contribute to t 1 1 an amount equal to (■ yi+s+'y+Si ) 2 : \yi\ < s y t +s > 0 , 
and since 7 +<5j > 7 > 0, then (yi+s+'y+Si) 2 > y 2 . Similarly each of the other points ’’contribute” 
to t T t at least y 2 , so t T t > IV 7 2 ). So: 


t T G~H > 


Ny 2 

-R 2 


> A 2 


since we assumed that N > . This is a contradiction, so we conclude that we cannot get 

this particular separation. □ 

Proof of part 2. 

Consider N points that can be shattered. This means that for any separation, for points in the 
first class there are Si > 0 such that |/(x*) + b — yi\ = s + 7 + <5*. For points in the second class 
there are Si G [0, s — 7] such that |/(ay) + b — y,;| = s — 7 — <5*. As in the case b = 0 we can remove 
the absolute values by considering for each class two types of points (we call them type 1 and 
type 2). For class 1, type 1 are points for which f(xi ) = yi + s + 7 + <5* — b = 1 1 — b. Type 2 are 
points for which f(xi ) = y* — s — 7 — Si — b = U — b. For class 2, type 1 are points for which 
f(xi) — yi + s — 7 — Si — b — U — b. Type 2 are points for which f(xi) = yi — s + 7 + Si — b = ti — b. 
Variables ti are as in the case b = 0. Let Sn, S 12 , S-n, S -12 denote the four sets of points (5V, 
are points of class i type j). Using lemma 1, we only need to consider functions of the form 
f(x) = J2iLi oiiK(xi,x). The coefficients a* are given by a. = G~ l (t — b ) there b is a vector of 
6 ’s. As in the case b = 0, the RKHS norm of this function is at least 


-^(t-b) T (t-b). 


( 11 ) 


The b that minimizes (11) is jy(J2iLiU)- So (11) is at least as large as (after replacing b and 
doing some simple calculations) 0 ^ R2 i(U — tj) 2 . 

We now consider a particular separation. Without loss of generality assume that y\ < IJ 2 < 
• • • A Vn and that N is even (if odd, consider N — 1 points). Consider the separation where 
class 1 consists only of the ’’even” points {N, N — 2 , , 2}. The following lemma is shown in 
the appendix: 

Lemma 4.2 For the separation considered, Wfj = i(V — tj) 2 is at least as large as xiVLxil) _ 


A < 4 

N — * 


r 2 a 2 


Using Lemma 4.2 we get that the norm of the solution for the considered separation is at least 

, which completes 
icity of notation). 


as large as aVA—AC Since this has to be < A 2 we get that N 


L 7 


the proof (assume N > 4 and ignore additive constants less than 1 for simp 
□ 


In the case of L p loss functions, using the same argument as in the previous section we get 
that the Vy dimension in infinite dimensional RKHS is bounded by A in the first case of 

theorem 4.1, and by 4 ^^ ^ A in the second case of theorem 4.1. Finally for L e loss functions 
the bound on the Vy dimension is the same as that for L\ loss function, again using the argument 
of the previous section. 
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5 Conclusion 


We presented a novel approach for computing the Vy dimension of RKHS for L p and L e loss 
functions. We conclude with a few remarks. First notice that in the computations we did not 
take into account e in the case of L e loss function. Taking e into account may lead to better 
bounds. For example, considering \ f{x)—y\^,p > 1 as the loss function, it is clear from the proofs 
presented that the dimension is bounded by p ~ A ■ However the influence of e seems to 
be minor (given that e << B). Furthermore, it may be possible to extend the computations for 
more general loss functions. 

An interesting observations is that the eigenvalues of the matrix G appear in the computation 
of the Fy dimension. In the proofs we took into account only the largest and smallest eigen¬ 
values. If similar computations are made to compute the number of separations for a given set 
of points, then it is possible that all the eigenvalues of G are taken into account. This leads 
to interesting relations with the work in [12]. Finally, the bounds on the V 1 dimension can be 
used to get bounds on the generalization performance of regression machines of the form (4) [1,4]. 
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Appendix 

Proof of Lemma 4.1 

We introduce the N x oo matrix A in = y/X^(f> n {xi) and the new variable z n = ~^=. We can write 
system ( 10 ) as follows: 

Az = t. ( 12 ) 

Notice that the solution of the system of equation 10 with minimum RKHS norm, is equivalent 
to the Least Square (LS) solution of equation 12. Let us denote with z° the LS solution of system 
12. We have: 

z° = (A T A) + A T t (13) 

where + denotes pseudoinverse. To see how this solution looks like we use Singular Value 

Decomposition techniques: 

A = UEV T , 

A T = VZU T , 

from which A T A = VY?V T and (A T A) + = V N T,~ 2 V I f, where S ” 1 denotes the N xN matrix whose 
elements are the inverse of the nonzero eigenvalues. After some computations equation (13) can 
be written as: 

z° = = (V'£ N U^)(U N '£~ 2 U^)t = AG-A. (14) 

Using the definition of z° we have that 

OO OO N _ 

5Z »^n( X ) = 51 yA n4>n(x)A ni ai. (15) 

n= 1 n=1 i=1 

Finally, using the definition of A in we get: 

OO N 

53 = 53 K(x., Xj)o!j 

n= 1 i=1 


which completes the proof. □ 

Proof of Lemma 4.2 

Consider a point ( x^yi ) in An and a point ( Xj,yj ) in A_n such that y t > y 3 (if such a pair 
does not exist we can consider another pair from the cases listed below). For these points 
(ti - tj) 2 = (yi + s + 7 + Si - y 3 - s + 7 + 8j) 2 = ((yi - y 3 ) + 2y + <5* + 8j) 2 > 4y 2 . In a similar 
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way (taking into account the constraints on the Si s and on s) the inequality (t t — tj) 2 > 4y 2 can 
be shown to hold in the following two cases: 


Ou, Vi) e S n , (xj, yj) e £_n U £-12, y* > %• 
(®<, Vi) e £12, (ajj, Vj) e £_ii U £-12, y* < yj 


(16) 


Moreover, clearly 


N 


( \ 

f \1 

Y (U ~ tj? ^ 2 

E 

E fa ~ tj) 2 + E 

E (ti- tj) 2 

i,j =1 

ieSn 

yeS_n \JS-i2,yi>yj J *e*Si2 

yeS-ii \JS-i2,yi<yj J 


Using the fact that for the cases considered (U — tj) 2 > 4y 2 , the right side is at least 


87 2 E, eSll (number of points j in class — 1 with y, : > y 3 ) + 
d- 87 2 EieSi 2 (number of points j in class — 1 with y,; < yj) 


(17) 


Let Ji and J 2 be the cardinalities of £ n and £12 respectively. Because of the choice of the 
separation it is clear that (17) is at least 

8 7 2 ((1 + 2 + ... + h )) + (1 + 2 + ... + ( J 2 - 1 ))) 


(for example if £ = 2 in the worst case points 2 and 4 are in £ n in which case the first part of 
(17) is exactly 1+2). Finally, since I\ +I 2 = y, (17) is at least 8 7 2 N \§ A = - (A ) ~ 4 ^ , which proves 
the lemma. □ 
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